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CONVERTS THE ARRANGEMENT OF A COMBINATION OF 
BUNDLES OF BLOCKS TO BE PROCESSED INTO THAT 
OBTAINED BY ONE-D I MENS I ONALLY DIVIDING IT USING 
PARALLEL TRANSFER. IN THIS CASE, THE DIAGONAL 
BLOCK MUST BE SHARED BY ALL NODES. 

I 

APPLY LU DECOMPOSITION TO THE ONE-D 1 MENS I ONALLY 
DIVIDED AND ALLOCATED BLOCKS. IN THIS CASE. BOTH 
BLOCKS WITH THE SAME WIDTH AS THE SIDE OF THE 
CACHE AND BLOCKS WITH A WIDTH SMALLER THAN IT 
ARE SEPARATELY AND REFLECTIVELY PROCESSED. 
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S15- 



RESTORES THE ARRANGEMENT OBTAINED BY ONE- 
D I MENS I ONALLY DIVIDING THE LU-DECOMPOSED BLOCK 
TO THAT OBTAINED BY TWO-D I MENS I ONALLY DIVIDING 
THE ORIGINAL BLOCK. 

i 

AT THIS POINT. A SMALL BLOCK OBTAINED BY ONE- 
D I MENS I ONALLY DIVIDING THE DIAGONAL BLOCK AND 
THE REMAINING BLOCKS BY THE NUMBER OF NODES IS 
ALLOCATED TO EACH NODE. A BUNDLE OF BLOCKS IN 
THE ROW DIRECTION ARE UPDATED IN EACH NODE USING 
THE UPDATED DIAGONAL BLOCK SHARED BY ALL NODES. 
IN THIS CASE. A COLUMN BLOCK NEEDED FOR 
SUBSEQUENT UPDATE IS TRANSFERRED TO AN ADJACENT 
NODE SIMULTANEOUSLY WITH COMPUTATION. 



REDUNDANTLY ALLOCATES THE LAST BUNDLE TO EACH 
NODE WITHOUT DIVIDING IT AND APPLY LU 
DECOMPOSITION TO IT BY EXECUTING THE SAME 
COMPUTATION. THEN. A POSITION CORRESPONDING TO 
EACH NODE IS COPIED BACK. 



END 
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DIAGONAL 
PORTION 



TRANSFER PORTION OTHER THAN 
THE DIAGONAL PORTION 



WORK AREA USED FOR TRANSFER 
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■^A SMALL MATRIX PORTION I 
UPDATED BY A MATRIX 
PRODUCT. BASED ON 
INFORMATION ABOUT A ROW 
BLOCK IN EACH NODE. 
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BUFFER A 



BUFFER B 
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FIG. 12 



LU DECOMPOSITION (THE SIZE OF A PROBLEM TO BE SOLVED IS n=iblksunM 
xnumnord xtn, ASSUMING THAT THE NUMBER OF UNIT BLOCKS AND THE 
NUMBER OF NODES THE NUMBER OF UNIT BLOCKS ARE ibiksunit numnord, AND 
m RESPECTIVELY). EACH NODE RECEIVES lp(n) STORING COMMON MEMORY AREA 
A (k, n/numnord) (k>=n) OBTAINED BY TWO-DIMENSIONALLY AND EQUALLY A 
COEFFICIENT MATRIX AND THE HISTORY OF ROW REPLACEMENT ARGUMENTS. 



S20 

M 



SET A PROCESS NUMBER (1 THROUGH NUMBER OF NODES) IN nonord. 
SET THE NUMBER OF NODES (TOTAL NUMBER OF PROCESSES) 
IN numnord. 



S21^ 



X 



GENERATE THREADS IN EACH NODE. SET A THREAD NUMBER EACH 
NODE AND THE TOTAL NUMBER OF THREADS IN nothrd AND 
numthrd, RESPECTIVELY. 



S22 



SET BLOCK WIDTH ib I ksmacro=i bl ksunit x numnord, 
loop=n/(iblksunit x numthrd)-! (NUMBER OF REPETITIONS), 
1=1 AND lenbufmax=(n-iblksmacro)/numnord+iblksmacro. 



S23 



SECURE THE FOLLOWING WORK AREAS. 

wlul (lenbufmax, ibiksmacro), wlu2(ienbufmax, ibtksmacro), 
buf s ( I enbuf max, i b I ksun 1 1) , buf d ( 1 r nbuf max, i b I ksun i t) . 
A SUB-ROUTINE COMPUTES ACTUAL LENGTH I enbuf AT EACH TIME 
OF EXECUTION AND USES THE NECESSARY SIZE OF THIS AREA. 
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ESTABLISH BARRIER SYNCHRONIZATION AMONG NODES. 
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I enb I ks= (n- 1 x i b I ksmacro) /numnord-*- 1 b I ksmacro 



S28 



CALL A SUB-ROUTINE otob AND 
MODIFY THE ARRANGEMENT OF EACH 
S27-N / NODE BY COMBINING A DIAGONAL BLOCK 
IN THE i -TH BLOCK IN EACH NODE 
WITH A BLOCK WITH Ibiksmacro 
OBTAINED BY ONE-0 1 MENS I ONALLY 
DIVIDING THE BLOCK TO BE 
^PROCESSED^ 

I ^ 



ESTABLISH BARRIER SYNCHRONIZATION AMONG NODES. 



T 



S29> 



CALL A SUBROUTINE inter I u 
AND APPLY LU DECOMPOSITION TO THE^ 
BLOCK THAT IS STORED, DISTRIBUTED AND^ 
ALLOCATED IN ARRAY wlul. INFORMATION 
ABOUT ROW REPLACEMENT IS STORED IN 
ip(lsie) AS is=(i-l)*iblksmacro+1, 
ie=l*iblksmaoro. 
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FIG. 13 
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\ ESTABLISH BARRIER 

SYNCHRONIZATION AMONG NODES. 
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CALL A 
SUBROUTINE btoc 
AND RESTORE THE BLOCK 
LU-DECOMPOSED USING 
THE RE-LOCATED BLOCK 
TO THE ORIGINAL PLACE 
IN EACH NODE. 



S32- 



ESTABLISH BARRIER 
SYNCHRONIZATION AMONG NODES. 



S33- 



CALL A 
SUB-ROUTINE exrw 
AND PERFORM BOTH 
THE REPLACEMENT OF 
<ROWS AND THE UPDATEV 
OF ROW BLOCK. 



S34- 



ESTABLISH BARRIER 
SYNCHRONIZATION AMONG NODES. 
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CALL A 
SUB-ROUTINE nmcbt 
AND UPDATE THE RE-ALLOCATED' 
LU-DECOMPOSED BLOCK USING A 
MATRIX PRODUCT OF A COLUMN 
BLOCK PORTION AND A ROW BLOCK 
PORTION IN EACH NODE. UPDATE 
IT WHILE PREPARING SUBSEQUENT 
UPDATE. BY TRANSFERRING THE 
ROW BLOCK PORTION AMONG THE 
PROCESSORS ALONG A RING 
SIMULTANEOUSLY WITH 
COMPUTATION. 
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ESTABLISH BARRIER 
SYNCHRONIZATION AMONG NODES. 
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\ 



DELETE THE GENERATED 
THREADS. 
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CALL A 
SUBROUTINE 
UPDATE TH 
[BLOCK WHILE APPLY 
DECOMPOS 



lu 

LAST 
ING 



ON 



TO IT. 
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ESTABLISH BARRIER 
SYNCHRONIZATION AMONG NODES. 







I=i+1 
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S45 



RECE I VE A (k, n/numnord) . w I ul ( 1 enb Iks, I b I ksmacro) . 
AND bufsdenblks, ibiksunit) AND bufdClenblks. ibiksunit) AS 
ARGUMENTS. AND REPLACE THE ARRANGEMENT OF A BLOCK OBTAINED BY ADDING 
A BLOCK THAT IS OBTAINED BY DIVIDING A PORTION UNDER THE DIAGONAL 
BLOCK MATRIX PORTION OF A BUNDLE OF numnord OF THE i-TH BLOCKS WITH 
WIDTH ibiksunit IN EACH NODE BY numnord TO THE DIAGONAL BLOCK WITH 
THAT OF THE BLOCK DISTRIBUTED AND ALLOCATED TO EACH NODE. 



S46- 



EXECUTE nbase=(i-1)* lb I ksmacro. 
(i: THE NUMBER OF REPETITIONS OF A CALLING SOURCE MAIN 
LOOP), ibs=nbase+1. ibe=nbase+ib I ksmacro 
I en= (n- 1 be) /numnord , nbase2d= ( i -1 ) * i b I ksun 1 1 
ibs2d=nbase2d'^1 AND i be2d= i bs2d+ i b I ksun i t. THE NUMBER OF 
TRANSMITTING DATA IS lensend=(len+iblksmaGro)*iblksunit. 




S49 



DETERMINE A TRANSMITTING PORTION AND A RECEIVING 
PORTION. SPECIFICALLY. COMPUTE 
i dst=mod (nonord-1 + i y-1 , numnord) +1 
(TRANSMITTING DESTINATION NODE NUMBER) AND 
i srs=mod (nonord-1 +numnord-iy+1 , numnord) +1 
(TRANSMITTING SOURCE NODE NUMBER). 



850 



I 



STORE THE DIAGONAL BLOCK PORTION WITH WIDTH ibiksunit. 
ALLOCATED TO EACH NODE AND A PORTION THAT IS OBTAINED BY ONE- 
DIMENSIONALLY DIVIDING BLOCKS UNDER IT BY numnord AND THAT IS 
STORED WHEN RE-ALLOCATED (TRANSFER DESTINATION PORTION LOCATED 
IN THE ORDER OF THE NODE NUMBERS) ARE STORED IN THE LOWER 
SECTION OF THE BUFFER. SPECIFICALLY, EXECUTE 
buf d (1 : i b I ksmacro. 1 : i b I ksun i t) ^A ( i bs : i be. i bs2d : i be2d) 
i cps= i be+ ( i dst-1 ) * I en+1 , i cpe= i sps+ 1 en-1 
buf d ( i b I ksmacro+1 : 1 en+ i b I ksmacro, 1 : i b I ksun i t) ^ 
A(icps:icpe, ibs2d:ibe2d) THE COMPUTED RESULT IS COPIED 
IN PARALLEL IN EACH THREAD BY ONE-D I MENS I ONALLY 
DIVIDING IT BY THE NUMBER OF THREADS. 



S51 



0 



X 



THE COMPUTED RESULT IS TRANSMITTED/ RECEIVED (ALL 
NODES TRANSMIT). SPECIFICALLY, THE CONTENTS OF bufd 
ARE TRANSMITTED TO THE idst-TH NODE. 
AND IS RECEIVED IN bufs. 



(5 
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FIG. 15 
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0 



WAIT FOR THE COMPLETION OF THE 
TRANSMISSION/ RECEPTION. 
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\ 



ESTABLISH BARRIER 
SYNCHRONIZATION 



STORE THE DATA RECEIVED FROM THE isrs-TH 

NODE IN A CORRESPONDING POSITION IN wlul. 

SPECIFICALLY. EXECUTE 

i cp2ds= ( i sr s-1 ) * i b I ksun i t+1 . 

i cp2de= i cp2ds-i- 1 b I ksun i t-1 

wlul (1 : len-i-lblksmacro, icp2ds:1cp2de)<— 

buf s (1 : 1 en+ i b I ksun i t, 1 : 1 b I ksun i t) . 

THE COMPUTED RESULT IS COPIED IN PARALLEL 

EACH THREAD BY ONE-D I MENS I ONALLY DIVIDING 

BY THE NUMBER OF THREADS. 



IN 
IT 
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ly=iy+1 



c 



return 
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S64- 



RECEIVE A(k, n/numnord), wlul (lenblks. ibiksmacro) \ 
AND wlumicro(ncash) AS ARGUMENTS (THE SIZE OF 
wlumicro IS THE SAME AS THAT OF THE L2 CACHE.) j 




EXECUTE iblksmicro=nwidthmicro AND DESIGNATE 
THE DIAGONAL PORTION 

wlu(istmlcro: istmicro+lblksmlcro-l. Istmlcro: 
istmicro+iblksmicro-l ) OF PORTION 

wludstmicro: lenmacro, istmicro: iblksmicro+iblksmicro-1) 
OF wludenmacro, ibiksmacro) IN WHICH THE DIAGONAL BLOCK 
LOCATED IN THE SHARED AREA OF EACH NODE AND THE DIVIDED 
BLOCK ARE STORED AS diag. 
EXECUTE i rest=lstmlcro+iblksmicro. 

COMBINE A BLOCK OBTAINED BY ONE-DIMENSIONALLY AND EQUALLY 
DIVIDING wlud rest: lenmacro, istmicro: istmicro+ib I ksmioro- 
1) BY THE NUMBER OF THREADS WITH diag AND COPY IT TO AN 
AREA wlumicro IN EACH THREAD. SPECIFICALLY, EXECUTE 
lenmicro=(lenmacro-irest+numthrd)/numthrd AND COPY IT IN 
wlumicrodenmicro+iblksmicro, ibiksmicro ) TO OBTAIN 
lenblksmicro=lenmicro+iblksmicro. 




RETURN THE DIAGONAL PORTION OF THE PORTION DIVIDED 
AND ALLOCATED TO wlumicro, FROM THE wlumicro OF ONE 
THREAD TO THE ORIGINAL PLACE OF wlu, AND RETURN THE 

OTHER PORTION OF THE PORTION DIVIDED AND ALLOCATED 
TO wlumicro, FROM THE wlumicro OF EACH THREAD TO THE 
ORIGINAL PLACE OF wlu. 



© 



FIG. 17 



S65 



DETERMINE WHETHER 
nwidthmicro>=3*iblksmicromax OR 
nw i dthm i cro<=2'i' i b I ksm i cr omax. 



EXECUTE nw i dthm i cro2=nw i dthm i cr o/2. 
i stm i cr o2= i stm i cr o+nw i dthm i c r o2 AND 
nw i dthm i cro3=nw i dthm i cr o-nw i dthm i cr o2. 



S67 



EXECUTE nw i dthm i cro2=nw i dthm i cro/3. 
i stm i cro2= i stm i cro+nw i dthm i cr o2 AND 
nw i dthm i cro3=nw i dthm i c r o-nw i dthm i cr o2. 



S68 



S69- 



S70' 



CALL A SUB-ROUTINE interLU 
BY GIVING istimicro AND 
nwidthmicro2 AS istimicro 
AND nwidthmioro. 



UPDATE PORTION wl u(i stm i cro : istmacro+nwi dthm icro2- 

1 . i stmacro+nw i dthm 1 cro2 : i stmacro+nw i dthm i cro-1 ) 

(SUFFICIENT IF THIS IS UPDATED IN ONE THREAD). 

THIS IS UPDATED BY MULTIPLYING TO IT THE INVERSE 

MATRIX OF THE LOWER TR I -ANGULAR MATRIX OF 

w I u ( i stm i cro : i stmacro+nw i dthm 1 cro2-1 , 

i stm i cro : i stmacro+nw i dthm icro2-1) FROM LEFT. 



UPDATE wlu(istmicro2: lenmacro. 

istmicro2: i stm icro2+nwi dthm icro3-1) BY SUBTRACTING 

wlu(istmicro2: lenmacro, i stm i cro: istmicro2-l) x 

w I u (i stm i cro : 1 stmacro+nw i dthm i cro2-1 , 

i stmacro+nw i dthm icro2: i stmacro+nw i dthm icro-1) FROM 

IT. IN THIS CASE. THEY ARE COMPUTED IN PARALLEL BY 

ONE-DIMENSIONALLY AD EQUALLY DIVIDING IT BY THE 

NUMBER OF THREADS. 




RECEIVE A(k, n/numnord), wlul (lenbtks. ibtksmacro). 
wlumicroCleniblksmicro, ibikstnicro) AS ARGUMENTS (THE SIZE OF 
wlumicro IS THE SAME AS THAT OF THE L2 CACHE AND wlumlcrois 
SECURED IN EACH THREAD). APPLY LU DECOMPOSITION TO THE PORTION 
STORED IN wlumicro BY SUB-ROUTINE LUmicro. 
ist: THE LEADING POSITION OF A BLOCK TO BE LU-DECOMPOSED, 
WHICH IS INITIALLY "1" . 
nwidth: THE WIDTH OF A BLOCK. WHICH IS INITIALLY 
THE WIDTH OF THE ENTIRE BLOCK. 
Jblksmax: THE MAXIMUM NUMBER OF DIVIDED BLOCKS (APPROXIMATELY 8). 
A BLOCK IS NEVER DIVIDED BY A LARGER NUMBER THAN IT. 



S7 




DETECT THE i-TH ELEMENT WITH THE MAXIMUM ABSOLUTE 
VALUE IN EACH THREAD. AND STORE IT IN THE COMMON 
MEMORY AREA IN ORDER OF THREAD NUMBERS. 



DETECT THE MUX I MUM PIVOT IN THE NODE FROM THE ELEMENTS. 
THEN. DETERMINE THE MAXIMUM PIVOT IN ALL NODES IN EACH NODE 
BY COMMUNICATING. IN SUCH A WAY THAT EACH NODE HAS EACH SET 
OF THIS ELEMENT, ITS NODE NUMBER AND ITS POSITION (THE 
MAXIMUM PIVOT IS DETERMINED IN EACH NODE BY THE SAME 
METHOD). 
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S83 



INDEPENDENTLY REPLACE PIVOTS 
IN EACH THREAD SINCE THIS IS 
REPLACEMENT IN THE DIAGONAL 
BLOCK STORED IN ALL NODES 
AND THAT IN THE DIAGONAL 
BLOCK SHARED BY ALL THREADS. 
THE REPLACED POSITIONS ARE 
STORED IN ARRAY ip. 

1 



INDEPENDENTLY REPLACE 
PIVOTS IN EACH NODE. 
SPECIFICALLY, STORE A 

PIVOT ROW TO BE 
REPLACED IN A COMMON 
AREA AND REPLACE IT 
WITH THE DIAGONAL 
BLOCK PORTION OF 
EACH THREAD. STORE 
THE REPLACED POSITION 
IN ARRAY ip. 



COPY A ROW VECTOR 
TO BE REPLACED FROM 
A NODE HAVING THE 
MAXIMUM PIVOT BY 
INTER-NODE 
COMMUNICATION. 
THEN. REPLACE 
THE PIVOT ROW. 
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UPDATE THE ROW 
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X 



UPDATE THE UPDATE PORTIONS 
OF THE i-TH COLUMN AND ROW. 
i=i+l 



FIG. 19 
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S89 



EXECUTE nw i dth2=nw i dth/2. 
ist2=ist+nwidth2 AND 
nw i dth3=nw i dth-nw i dth2 



EXECUTE nwidth2=nwidth/3. 
ist2=ist+nwidth2 AND 
nw i dth3=nw i dth-nw i dth2 
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S92- 




UPDATE PORTION 

w I urn i cro ( i stm i cro : i stmacro+nw i dth2-1 . 

istmicro+nwidthZ: istmicro+nwidthmicro-1) BY 

MULTIPLYING TO IT THE INVERSE 

MATRIX OF THE LOWER TR I -ANGULAR MATRIX OF 

w I um i cr 0 (1 stm i cro : i stmacro+nw i dth2- 

1, Istmicro: istmacro+nwidth2-1) FROM LEFT. 
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I 



UPDATE 

wlumicro(istinicro2: lenmacro, istmicro2: istmicro2 
+nwidthmicro3-l) BY SUBTRACTING 
wluinicro(istmicro2: lenmacro, istmicro: istmicro2 
-1) xwlumicro (istmicro: istmacro+nwidth2-1, 

ist+nwidth2: ist+nwidthmicro-1) FROM IT. 




"kECE I VE A (k, n/numnord) , w I u1 ( I enb I ks J b I ksmacr o) , buf s ( I enb I ks, i b 1 ksun i t) , 
bufddenblks, ibiksunit) AS ARGUMENTS. AND REPLACE THE ARRANGEMENT OF A BLOCK 
OBTAINED BY ADDING A BLOCK THAT IS OBTAINED BY DIVIDING A PORTION UNDER THE 
DIAGONAL BLOCK MATRIX PORTION ibiksmacrox ibi ksmacr o OF A BUNDLE OF numnord OF 
JHE i-TH BLOCKS WITH WIDTH ibiksunit IN EACH NODE BY numnord TO THE DIAGONAL 
BLOCK WITH THAT OF THE BLOCK DIST RIBUTED AND ALLOCATED TO EACH NODE. 

I 



\ 



EXECUTE nbase= (i-l)*ib I ksmacr 0 (i: THE NUMBER OF REPETITIONS OF A 

CALLING SOURCE MAIN LOOP), ibs=nbase+1, ibe=nbase+iblksmacro 
I en=(n-i be) /numnord. nbase2d=(i-1)*iblksunit. i bs2d=nbase2d+1 AND 
ibe2d=ibs2d+iblksunit . THE NUMBER OF TRANSMITTfNG DATA IS 
I ensend= ( I en-t- i b I ksmacro) i b I ksun i t. 
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DETERMINE A TRANSMITTING PORTION AND A 
RECEIVING PORTION. SPECIFICALLY. COMPUTE 

idst=mod(nonord-1+iy-l. numnord) +1 AND 
1 srs=mod (nonord-1 +numnord- i y+1 . numnord) +1 . 



TRANSFER THE COMPUTED RESULT FROM wlul TO THE BUFFER AND STORE IT THERE 
TO BE TRANSMITTED TO RESTORE THE ARRANGEMENT OF BLOCKS TO THE ORIGINAL ONE. 

SPECIFICALLY. EXECUTE 
icp2ds=(idst-1)*lblksunit+l. icp2de=icp2ds+iblksunit-l, 
bufdd : lerH-iblksunit, 1 : ibiksunit) *- wlul (1 : 1 en+ib I ksmacro, icp2ds: icp2de). 
ONE-DIMENSIONALLY DIVIDING THE COMPUTED RESULT BY THE NUMBER OF THREADS 
AND COPY IT TO EACH NODE IN PARALLEL. 



S10&-v_ 



X 



TRANSMIT/ RECEIVE THE COMPUTED RESULT 
(ALL NODES TRANSMIT). SPECIFICALLY TRANSMIT 
THE CONTENTS OF bufd TO THE idst-TH NODE. 
AND RECEIVE IT IN bufs. 
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WAIT FOR THE COMPLETION OF 
THE TRANSMISSION/ RECEPTION. 
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ESTABLISH BARRIER SYNCHRONIZATION. 



STORE THE DIAGONAL BLOCK PORTION WITH WIDTH ibiksunit ALLOCATED 
TO EACH NODE AND THE PORTION REPLACED WITH THE PORTION OBTAINED 
BY ONE-DIMENSIONALLY DIVIDING A BLOCK LOCATED UNDER IT BY 
numnord (PORTION LOCATED IN ORDER OF THE NUMBER OF TRANSFER 
DESTINATION NODES) 
IN THEIR ORIGINAL POSITIONS. EXECUTE OR EXECUTE 
Adbs: ibe. ibs2d: i be2d) buf s (1 : ibiksmacro. 1 : ibiksunit). 
i cps= i be+ ( i s r s-1 ) * I en+1 . i cpe= i sps+ 1 en-1 . 
A(icps: icpe. ibs2d: ibe2d) *- 
bufs(iblksmacro+1 : len+iblksmacro. 1 : ibiksunit). 
THE COMPUTED RESULT IS ONE-DIMENSIONALLY DIVIDED BY THE NUMBER 
OF THREADS AND IS COPIED FOR EACH COLUMN IN EACH THREAD. 



Sli( 



iy=iy+l 
I 

FIG. 2 1 



c 



return 



RECEIVE A(k, n/numnord) AND wlul (lenblks. ibiksmacro) 
AS ARGUMENTS. THE LU-DECOMPOSED DIAGNAL PORTION IS 
STORED IN wlul (1 : ibiksmacro, 1 : ibiksmacro) AND IS 
SHARED BY ALL NODES. nbdiag=(i-1)*iblksmacro (I: THE 
NUMBER OF REPETITIONS OF THE MAIN LOOP OF CALLING 
SOURCE SUB-ROUTINE pLU) OR INFORMATION ABOUT PIVOT 
REPLACEMENT IS STORED IN 
I p (nbd i ag+1 : nbd I ag+ i b I ksmacro) . 



-S115 
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EXECUTE nbase=i*iblksunit (i: THE NUMBER 
OF REPETITIONS OF THE MAIN LOOP OF CALLING 
SOURCE SUB-ROUTINE pLU) 
irows=nbase+l, i rowe=n/nuninord, 
I en= ( i rowe- 1 rows+1 ) /numthr d, 
is=nbase+(nothrd-1)*len+1 and 
ie=niin(irowe, is+len-1). 
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I 

ix=ls 




S1 18 



nbd i ag= ( i -1 ) * i b I ksmacro, J=nbd i ag+1 
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SI 22 



REPLACE A(j, ix) WITH A(ip(j), ix). 
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j=j+l 
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ix=ix+1 
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ESTABLISH BARRIER SYNCHRONIZATION 
(OF ALL NODES AND ALL THREADS). 



I 



UPDATE A (nbd i ag+1 : nbd i ag+ 1 b I ksmacro, i s : i e) «- 
TRL(wlul (1 : ibiksmacro, 11 : iblksmacro))-1 x 
A (nbd i ag+1 : nbd i ag+ i b I ksmacro, i s : i e) IN ALL 

NODES AND THREADS. TRL(B) REPRESENTS 
THE LOWER TR I -ANGULAR PORTION OF MATRIX B. 
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I 



ESTABLISH BARRIER SYNCHRONIZATION 
(OF ALL NODES AND ALL THREADS). 



FIG. 2 2 



c 



return 
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RECEIVE wlul (lenblks, ibiksroacro) 
AND w I u2( lenblks, iblksmacro) AS ARGUMANTS. 
ONE RESULT OF LU-OECOMPOSING A BLOCK WITH WIDTH iblksmacro, 
BEING ONE BLOCK OBTAINED BY ONE-D IMENT I ONALLY DIVIDING BOTH 
DIAGONAL BLOCK AND A BLOCK LOCATED UNDER IT BY numnord IS 
STORED IN wlul. RE-ALLOCATED IT IN ITS DIVIDED ORDER IN 
CORRESPONDENCE WITH ITS NODE NUMBER. UPDATE THIS WHILE 
TRANSFERRING THIS ALONG THE RING OF NODES (TRANSFERRING 
SIMULTANEOUSLY WITH COMPUTATION) AND COMPUTING A MATRIX 
PRODUCT (SINCE THERE IS NO INaUENCE ON PERFORMANCE, 
A DIAGONAL BLOCK PORTION NOT DIRECTLY USED FOR THE 
DMPUTATION IS ALSO TRANSMITTED WHILE COMPUTING). 
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EXECUTE nbase=(i-1)*iblksinacro (i: THE NUMBER OF REPETITIONS 
OF THE MAIN LOOP OF CALLING SOURCE 

SUB-ROUTINE pLU), 
i bs=nbase+1 , i be=nbase+ i b I ksmacro. 
I en= (n- i be) /numnord. nbase2d= ( i -1) * i b I ksun i t. 
ibs2d=nbase2d-i-1, ibe2d=ibs2d-i-iblksunit, n2d=n/nuinnord AND 
lensend=len+iblksmacro. THE NUMBER OF TRANSMITTING DATA IS 
nw I en= I ensend'c i b I ksmacro. 
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EXECUTE iy=l (SET AN INITIAL VALUE), 
i dst=mod (nonor d, numnord) +1 (TRANSM I TT I NG DEST I NAT I ON 
NODE NUMBER (ADJACENT NODE)). isrs=mod(nonord-l+numnord 
-1. numnord) -fl (TRANSMITTING SOURCE NODE NUMBER ) AND 

ibp=idst. 
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WAIT FOR THE COMPLETION OF 
THE TRANSMISSION/ RECEPTION. 
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TRANSIT/ RECEIVE THE COMPUTED RESULT. 
SPECIFICALLY, TRANSMIT THE CONTENTS OF wlul 
(INCLUDING THE DIAGONAL BLOCK) TO ITS ADJACENT 
NODE (NODE NUMBER idst. ALSO STORE DATA 
TRANSMITTED (FROM NUMBER isrs) IN wlu2. THE 
TRANSMITTING/ RECEIVING DATA LENGTH IS nwlen. 
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COMPUTE THE POSITION OF UPDATE 
USING DATA STORED IN wlu1. 
EXECUTE i bp=mod ( i bp-1 +numnord-1 . numnord) +1 
AND ncptr=nbe+(ibp-1)*len+l 
(ONE-DIMENSIONAL STARTING POSITION). 



FIG. 2 3 



® 



0 
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WAIT FOR THE COMPLETION OF THE TRANSMISSION/ 
RECEPTION CONDUCTED SIMULTANEOUSLY WITH THE 
COMPUTATION OF A MATRIX PRODUCT. 




TRANSIT/ RECEIVE THE COMPUTED RESULT. 
SPECIFICALLY. TRANSMIT THE CONTENTS OF wlul 
(INCLUDING THE DIAGONAL BLOCK) TO ITS ADJACENT 

NODE (NODE NUMBER Idst). ALSO STORE DATA 
TRANSMITTED (FROM NODE NUMBER Isrs) IN wlu2. 
TRANSMITTING/ RECEIVING DATA LENGTH IS nwlen. 
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COMPUTE THE POSITION OF UPDATE USING 

DATA STORED IN wlu2. EXECUTE 
i bp=niod ( i bp-1+numnord-1 , numnord) +1 AND 
ncptr=nbe+(ibp-1)*len+1 (ONE- 
DIMENSIONAL STARTING POSITION). 




c 



return 



:) 
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RECEIVE A(k.n/numnord) AND wlul (lenblks, 
ibiksmacro) OR w I u2( lenblks, ibiksmacro) IN 
w lux (lenblks, ibiksmacro). UPDATE A SQUARE AREA 
USING THE ONE-DIMENSIONAL STARTING POSITION ncptr 
TRANSFERRED FROM THE CALLING SOURCE. EXECUTE 
Is2d=i*lblksunit+1, i e2d=n/numnord, Ien=ie2d- 
is2d+1, is1d=ncptr, ie1d=nptr+len-1 (i : THE 
NUMBER OF REPETITIONS OF SUB-ROUTINE pLU). 
A(is1d:ield, is2d: ie2d)=A(ls1d: ield, is2d:ie2d)- 
wlux(iblksmacro+1 : Iblksmacro+len, 1 : ibiksmacro) x 
ACisld-lblksmacro: Is1d-1, is2d: Ie2d) 
(EQUATION 1) 

I 

COMPUTE AND ROUND UP THE ROOT OF 
THE NUMBER OF THREADS PROCESSING IN PARALLEL. 

numroot = int(sqrt(nutnthrd)) 
if (sqrt(numthrd)-numroot. ne. 0) numroot=numroot +1 
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ml=numroot, m2=nuniroot 
mx=ni1 
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tn1=nix 
inx=mx-1 
iMii-iiix xni2 
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ONE-OIMENSIONALLY AND EQUALLY DIVIDE AN AREA TO BE 
UPDATED. THEN, TWO-DIMENSIONALLY AND EQUALLY DIVIDE 
m2 INTO ni1*m2 RECTANGLES. ALLOCATE numthrd OF THEM TO 
EACH THREAD AND COMPUTE THE CORRESPONDING PORTION OF 
EQUATION 1 IN PARALLEL. TEO-D I MENS I ONALLY AND 
SEQUENTIALLY ALLOCATE THE THREADS IN SUCH A WAY AS 
(1.1), (1.2).... (1.in2). (2,1).... 




m1*ni2-numthrd FROM THE RIGHT END OF THE LAST ROW 
_ OF THE LAST RECTANGLE ARE LEFT NOT UPDATED. 

V COMBINE THESE RECTANGLES INTO ONE RECTANGLE, 
TWO-DIMENSIONALLY DIVIDE IT BY THE NUMBER OF 
THREADS AND COMPUTE THE CORRESPONDING PORTION OF 
EQUATION 1 IN PARALLEL. 
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ESTABLISH BARRIER SYNCHRONIZATION 
(AMONG THE THREADS). 



Q return ^ 

FIG. 2 5 
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RECEIVE A(k, n/numnord), wlul (ibiksmacro, ibiksmacro), 
bufsCiblksmacro, ibiksunit) AND bufd (ibiksmacro, ibiksunit) 
AS ARGUMENTS. AND TRANSIT A NON-ALLOCATED PORTION TO EACH 
NODE SO THAT A BUNDLE OF numnord OF THE LAST BLOCK WITH 
WIDTH ibiksunit OF EACH NODE CAN BE SHARED BY ALL NODES. 
AFTER ibiksmacro X ibiksmacro OF BLOCKS ARE SHARED BY ALL 
NODES, APPLY LU DECOMPOSITION TO THE SAME MATRIX IN EACH 
NODE. AFTER THE LU DECOMPOSITION IS COMPLETED. COPY BACK 
A PORTION ALLOCATED TO EACH NODE. 
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EXECUTE nbase=n-iblksmacro, 
ibs=nbase+1, ibe=n, 
I en= i b I ksmacro, nbase2d= ( i -1 ) * i b I ksun i t, 
ibs2d=n/numnord-iblksunit+1, ibe2d=n/numnord. THE NUMBER 
OF TRANSMITTING DATA IS lensend=iblksmacro>iciblksunit. 

iy=1 
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COPY THE COMPUTED RESULT IN THE BUFFER. 
SPEC I F I CALLY. buf d (1 : i b I ksmacro, 1 : i b I ksun i t) • 
A(ibs: ibe, ibs2d: ibe2d) 
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DETERMINE A TRANSMITTING PORTION AND A 

RECEIVING PORTION. SPECIFICALLY. EXECUTE 

idst=niod(nonord-l + iy-1. numnord) +1 AND 
i srs=mod (nonord-1+numnord-i y+1 , numnord) +1 



I 
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TRANSMIT/ RECEIVE THE COMPUTED RESULT (ALL NODES 
TRANSMIT). SPECIFICALLY. TRANSMIT THE CONTENTS 
OF bufd TO THE idst-TH NODE. 



X 



RECEIVE IT IN bufs. WAIT FOR THE 
COMPLETION OF THE TRANSMISSION/ RECEPTION. 
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ESTABLISH BARRIER SYNCHRONIZATION. 



I 



STORE THE COMPUTED RESULT IN THE 
CORRESPONDING POSITION OF wlul. STORE 
THE DATA RECEIVED FROM THE isrs-TH NODE. 
EXECUTE i cp2ds= ( i sr s-1 ) * 1 b I ksun i t+1 . 

1 cp2de= 1 cp2ds-i- i b I ksun 1 1-1 . AND 
wlul (1 : ibiksmacro, icp2ds: icp2de)<— 
bufsd : ibiksunit. 1 : ibiksunit). 
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ly=iy+1 
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ESTABLISH BARRIER 

SYNCHRONIZATION. 



EXECUTE IN PARALLEL THE LU 
DECOMPOSITION OF lb I ksmacro x 
ibiksmacro IN wlul OF EACH 

node. store information about 
row replacement, 
after the lu decomposition is 
completed; copy back the 
computed result for the 
relevant node to the last 
block. specifically, execute 

i s= (nonord- 

1)*iblksunit+1, ie=is+iblksunit- 
1. A(ibs:ibe. ib82d: ibe2d) — 
wlul (1 : ibiksmacro. is: ie). 



7^ 
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C 



return 



3 
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